最近对基于细粒的基于草图的图像检索(FG-SBIR)的重点已转向将模型概括为新类别,而没有任何培训数据。但是,在现实世界中,经过训练的FG-SBIR模型通常应用于新类别和不同的人类素描器,即不同的绘图样式。尽管这使概括问题复杂化,但幸运的是,通常可以使用一些示例,从而使模型适应新的类别/样式。在本文中,我们提供了一种新颖的视角 - 我们没有要求使用概括的模型,而是提倡快速适应的模型,在测试过程中只有很少的样本(以几种方式)。为了解决这个新问题,我们介绍了一种基于几个关键修改的基于新型的模型 - 静态元学习(MAML)框架:(1)作为基于边缘的对比度损失的检索任务,我们简化了内部循环中的MAML训练使其更稳定和易于处理。 (2)我们的对比度损失的边距也通过其余模型进行了元学习。 (3)在外循环中引入了另外三个正规化损失,以使元学习的FG-SBIR模型对类别/样式适应更有效。在公共数据集上进行的广泛实验表明,基于概括和基于零射的方法的增益很大,还有一些强大的射击基线。
translated by 谷歌翻译
In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary {modalities} -- sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, we focus on learning a flexible joint embedding that fully supports the ``optionality" that this complementarity brings. Our embedding supports optionality on two axis: (i) optionality across modalities -- use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks -- simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy at the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications.
translated by 谷歌翻译
我们使用徒手场景草图FS-Coco的第一个数据集将草图研究推向了场景。考虑到实用的应用,我们收集的草图很好地传达了场景内容,但可以在几分钟之内由具有素描技巧的人勾勒出来。我们的数据集包含10,000个徒手场景向量素描,每点时空信息由100个非专家个人提供,提供对象和场景级抽象。每个草图都用文本描述增强。使用我们的数据集,我们首次研究了徒手场景草图和草图标题的细粒度图像检索问题。我们了解以下内容:(i)使用笔触的时间顺序在草图中编码的场景显着性; (ii)从场景草图和图像标题中进行图像检索的性能比较; (iii)素描和图像标题中信息的互补性,以及结合两种方式的潜在优势。此外,我们扩展了一个流行的矢量草图基于LSTM的编码器,以处理比以前的工作所支持的更复杂性的草图。也就是说,我们提出了一个层次草图解码器,我们将其在特定于草图的“预文本”任务中利用。我们的数据集可以首次研究徒手场景素描理解及其实际应用。
translated by 谷歌翻译
创意素描或涂鸦是一种表达活动,在那里绘制了想象力和以前看不见的日常视觉物体的描述。创意草图图像生成是一个具有挑战性的视觉问题,任务是生成不同但现实的创意草图,拥有视觉世界对象的看不见的构成。在这里,我们提出了一种新颖的粗待精细的两级框架,DooDleformer,将创意草图生成问题分解成粗略草图组合物的创建,然后在草图中掺入细节。我们介绍了图形感知的变压器编码器,可有效地捕获了不同身体部位之间的全局动态以及局部静态结构关系。为确保所生成的创意草图的多样性,我们介绍了一个概率粗略草图解码器,该解码器明确地模拟了要绘制的每个草图身体部位的变化。在两个创意素描数据集上进行实验:创意鸟类和创意生物。我们的定性,定量和以人为主的评估表明,DooDleformer在两个数据集中表现出最先进的,屈服于现实和多样化的创意草图。在创意生物上,DooDleformer通过最先进的FR`chet unception距离(FID)来实现25的绝对增益。我们还展示了DoodleFormer对创意草图生成和草图完成的相关申请的有效性。
translated by 谷歌翻译
Quadruped robots are currently used in industrial robotics as mechanical aid to automate several routine tasks. However, presently, the usage of such a robot in a domestic setting is still very much a part of the research. This paper discusses the understanding and virtual simulation of such a robot capable of detecting and understanding human emotions, generating its gait, and responding via sounds and expression on a screen. To this end, we use a combination of reinforcement learning and software engineering concepts to simulate a quadruped robot that can understand emotions, navigate through various terrains and detect sound sources, and respond to emotions using audio-visual feedback. This paper aims to establish the framework of simulating a quadruped robot that is emotionally intelligent and can primarily respond to audio-visual stimuli using motor or audio response. The emotion detection from the speech was not as performant as ERANNs or Zeta Policy learning, still managing an accuracy of 63.5%. The video emotion detection system produced results that are almost at par with the state of the art, with an accuracy of 99.66%. Due to its "on-policy" learning process, the PPO algorithm was extremely rapid to learn, allowing the simulated dog to demonstrate a remarkably seamless gait across the different cadences and variations. This enabled the quadruped robot to respond to generated stimuli, allowing us to conclude that it functions as predicted and satisfies the aim of this work.
translated by 谷歌翻译
Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as gains on long-tail object queries, and the ability to perform zero-shot and few-shot NLQ.
translated by 谷歌翻译
Machine Translation (MT) system generally aims at automatic representation of source language into target language retaining the originality of context using various Natural Language Processing (NLP) techniques. Among various NLP methods, Statistical Machine Translation(SMT). SMT uses probabilistic and statistical techniques to analyze information and conversion. This paper canvasses about the development of bilingual SMT models for translating English to fifteen low-resource Indian Languages (ILs) and vice versa. At the outset, all 15 languages are briefed with a short description related to our experimental need. Further, a detailed analysis of Samanantar and OPUS dataset for model building, along with standard benchmark dataset (Flores-200) for fine-tuning and testing, is done as a part of our experiment. Different preprocessing approaches are proposed in this paper to handle the noise of the dataset. To create the system, MOSES open-source SMT toolkit is explored. Distance reordering is utilized with the aim to understand the rules of grammar and context-dependent adjustments through a phrase reordering categorization framework. In our experiment, the quality of the translation is evaluated using standard metrics such as BLEU, METEOR, and RIBES
translated by 谷歌翻译
We introduce Argoverse 2 (AV2) - a collection of three datasets for perception and forecasting research in the self-driving domain. The annotated Sensor Dataset contains 1,000 sequences of multimodal data, encompassing high-resolution imagery from seven ring cameras, and two stereo cameras in addition to lidar point clouds, and 6-DOF map-aligned pose. Sequences contain 3D cuboid annotations for 26 object categories, all of which are sufficiently-sampled to support training and evaluation of 3D perception models. The Lidar Dataset contains 20,000 sequences of unlabeled lidar point clouds and map-aligned pose. This dataset is the largest ever collection of lidar sensor data and supports self-supervised learning and the emerging task of point cloud forecasting. Finally, the Motion Forecasting Dataset contains 250,000 scenarios mined for interesting and challenging interactions between the autonomous vehicle and other actors in each local scene. Models are tasked with the prediction of future motion for "scored actors" in each scenario and are provided with track histories that capture object location, heading, velocity, and category. In all three datasets, each scenario contains its own HD Map with 3D lane and crosswalk geometry - sourced from data captured in six distinct cities. We believe these datasets will support new and existing machine learning research problems in ways that existing datasets do not. All datasets are released under the CC BY-NC-SA 4.0 license.
translated by 谷歌翻译
Cashews are grown by over 3 million smallholders in more than 40 countries worldwide as a principal source of income. As the third largest cashew producer in Africa, Benin has nearly 200,000 smallholder cashew growers contributing 15% of the country's national export earnings. However, a lack of information on where and how cashew trees grow across the country hinders decision-making that could support increased cashew production and poverty alleviation. By leveraging 2.4-m Planet Basemaps and 0.5-m aerial imagery, newly developed deep learning algorithms, and large-scale ground truth datasets, we successfully produced the first national map of cashew in Benin and characterized the expansion of cashew plantations between 2015 and 2021. In particular, we developed a SpatioTemporal Classification with Attention (STCA) model to map the distribution of cashew plantations, which can fully capture texture information from discriminative time steps during a growing season. We further developed a Clustering Augmented Self-supervised Temporal Classification (CASTC) model to distinguish high-density versus low-density cashew plantations by automatic feature extraction and optimized clustering. Results show that the STCA model has an overall accuracy of 80% and the CASTC model achieved an overall accuracy of 77.9%. We found that the cashew area in Benin has doubled from 2015 to 2021 with 60% of new plantation development coming from cropland or fallow land, while encroachment of cashew plantations into protected areas has increased by 70%. Only half of cashew plantations were high-density in 2021, suggesting high potential for intensification. Our study illustrates the power of combining high-resolution remote sensing imagery and state-of-the-art deep learning algorithms to better understand tree crops in the heterogeneous smallholder landscape.
translated by 谷歌翻译
We propose an ensemble approach to predict the labels in linear programming word problems. The entity identification and the meaning representation are two types of tasks to be solved in the NL4Opt competition. We propose the ensembleCRF method to identify the named entities for the first task. We found that single models didn't improve for the given task in our analysis. A set of prediction models predict the entities. The generated results are combined to form a consensus result in the ensembleCRF method. We present an ensemble text generator to produce the representation sentences for the second task. We thought of dividing the problem into multiple small tasks due to the overflow in the output. A single model generates different representations based on the prompt. All the generated text is combined to form an ensemble and produce a mathematical meaning of a linear programming problem.
translated by 谷歌翻译